Creating effective visualisations

Author

Ina Bornkessel-Schlesewsky

Published

January 20, 2023

Background

This document provides a brief introduction to creating effective visualisations, in addition to showing you (some aspects of) how to customise ggplots. Don’t feel that you have to memorise all of this information: the document is intended to serve as a reference and to give you some ideas about what is possible when it comes to the creation of plots.

Some of the following is adapted from a blogpost by Cédric Scherer under a Creative Commons Attribution 4.0 International licence.

For further details on colour scales, figure types and other considerations for creating effective visualisations, see Claus Wilke’s book Fundamentals of Data Visualization (2019, O’Reilly), online version available here.

Load required packages

library(tidyverse)
library(ggsci)
library(gapminder)

Example data set (already discussed in the context of data import)

Student-to-teacher ratios in different parts of the world:

Read in data and inspect

st_ratios_full <- read_csv("student_teacher_ratios.csv")

head(st_ratios_full)
# A tibble: 6 × 20
  indicator    country count…¹ eduli…²  year stude…³ flag_…⁴ flags name  alpha.2
  <chr>        <chr>   <chr>   <chr>   <dbl>   <dbl> <chr>   <chr> <chr> <chr>  
1 Primary Edu… Afghan… AFG     PTRHC_1  2017    44.0 <NA>    <NA>  Afgh… AF     
2 Primary Edu… Albania ALB     PTRHC_1  2017    17.9 <NA>    <NA>  Alba… AL     
3 Primary Edu… Algeria DZA     PTRHC_1  2017    24.2 <NA>    <NA>  Alge… DZ     
4 Primary Edu… Angola  AGO     PTRHC_1  2015    50.0 <NA>    <NA>  Ango… AO     
5 Primary Edu… Antigu… ATG     PTRHC_1  2017    12.1 <NA>    <NA>  Anti… AG     
6 Primary Edu… Argent… ARG     <NA>       NA    NA   <NA>    <NA>  Arge… AR     
# … with 10 more variables: alpha.3 <chr>, country.code <chr>,
#   iso_3166.2 <chr>, region <chr>, sub.region <chr>, region.code <chr>,
#   sub.region.code <chr>, x <dbl>, y <dbl>, student_ratio_region <dbl>, and
#   abbreviated variable names ¹​country_code, ²​edulit_ind, ³​student_ratio,
#   ⁴​flag_codes
glimpse(st_ratios_full)
Rows: 180
Columns: 20
$ indicator            <chr> "Primary Education", "Primary Education", "Primar…
$ country              <chr> "Afghanistan", "Albania", "Algeria", "Angola", "A…
$ country_code         <chr> "AFG", "ALB", "DZA", "AGO", "ATG", "ARG", "ARM", …
$ edulit_ind           <chr> "PTRHC_1", "PTRHC_1", "PTRHC_1", "PTRHC_1", "PTRH…
$ year                 <dbl> 2017, 2017, 2017, 2015, 2017, NA, NA, 2017, 2017,…
$ student_ratio        <dbl> 44.00995, 17.94478, 24.22505, 50.02951, 12.05576,…
$ flag_codes           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ flags                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ name                 <chr> "Afghanistan", "Albania", "Algeria", "Angola", "A…
$ alpha.2              <chr> "AF", "AL", "DZ", "AO", "AG", "AR", "AM", "AT", "…
$ alpha.3              <chr> "AFG", "ALB", "DZA", "AGO", "ATG", "ARG", "ARM", …
$ country.code         <chr> "004", "008", "012", "024", "028", "032", "051", …
$ iso_3166.2           <chr> "ISO 3166-2:AF", "ISO 3166-2:AL", "ISO 3166-2:DZ"…
$ region               <chr> "Asia", "Europe", "Africa", "Africa", "North Amer…
$ sub.region           <chr> "Southern Asia", "Southern Europe", "Northern Afr…
$ region.code          <chr> "142", "150", "002", "002", "019", "019", "142", …
$ sub.region.code      <chr> "034", "039", "015", "017", "029", "005", "145", …
$ x                    <dbl> 22, 15, 13, 13, 7, 6, 20, 15, 21, 4, 20, 23, 8, 1…
$ y                    <dbl> 8, 9, 11, 17, 4, 14, 6, 6, 7, 2, 9, 8, 6, 4, 5, 3…
$ student_ratio_region <dbl> 19.64278, 13.01069, 36.38758, 36.38758, 16.18269,…

Restrict to relevant columns:

st_ratios <- st_ratios_full |> 
  select(country,year,student_ratio,region)

st_ratios
# A tibble: 180 × 4
   country              year student_ratio region       
   <chr>               <dbl>         <dbl> <chr>        
 1 Afghanistan          2017          44.0 Asia         
 2 Albania              2017          17.9 Europe       
 3 Algeria              2017          24.2 Africa       
 4 Angola               2015          50.0 Africa       
 5 Antigua and Barbuda  2017          12.1 North America
 6 Argentina              NA          NA   South America
 7 Armenia                NA          NA   Asia         
 8 Austria              2017          10.0 Europe       
 9 Azerbaijan           2017          15.5 Asia         
10 Bahamas              2016          19.0 North America
# … with 170 more rows

Customisation for a more effective visualisation 1: basics

The student-teacher ratio data provide an interesting example, as we are interested in visualising both a measure of central tendency per region as well as the variability per region

Let’s start with a bar / column graph. We start by summarising the data and then plot the graph. Can you recall why we need to summarise first before plotting in this way?

st_by_region <- st_ratios  |>  
  group_by(region) |> 
  summarise(
    median_st_ratio = median(student_ratio, na.rm=TRUE),
    mean_st_ratio = mean(student_ratio, na.rm=TRUE),
    sd_st_ratio = sd(student_ratio, na.rm=TRUE))

st_by_region |> 
  ggplot(aes(x = region, y = median_st_ratio)) +
  geom_col()

To make the graph more readable, we should sort the columns into some meaningful order. Recall that we can use fct_reorder() to do this; it allows us to order the levels of a categorical variable (region) by the values of another column (in this case, median_st_ratio).

st_by_region |> 
  ggplot(aes(x = fct_reorder(region,median_st_ratio), 
             y = median_st_ratio)) +
  geom_col()

For graphs with long labels, flipping to a horizontal orientation is useful to improve readability. This is easily accomplished via coord_flip() (an alternative to swtching out x and y in the aesthetics).

st_by_region |> 
  ggplot(aes(x = fct_reorder(region,median_st_ratio), 
             y = median_st_ratio)) +
  geom_col() +
  coord_flip()

In this case, we likely want to reorder so that the lowest value is at the top, given that this is the “best”. Note how we can do this by using a - just as in arrange(). We also add a title, remove the y-axis label, add a better x-axis label. Note that, if we want to use coord_flip(), we need to refer to the original (“unflipped”) axes e.g. with our labels.

st_by_region |> 
  ggplot(aes(x = fct_reorder(region,-median_st_ratio), 
             y = median_st_ratio)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Student-to-teacher ratios are lowest in Europe and highest in Africa",
    y = "Median student-to-teacher ratio",
    x = ""
  )

Change the theme to further customise and add some colour to the bars. Note that, if you want to change aspects of the plot that don’t involve mapping aspects of the data to aspects of the visualisation, these specs don’t go into the aesthetics – as in the example below.

You can find all available colours in R using colours().

st_by_region |> 
  ggplot(aes(x = fct_reorder(region,-median_st_ratio), 
             y = median_st_ratio)) +
  geom_col(fill = "steelblue", alpha=0.8) +
  coord_flip() +
  labs(
    title = "Student-to-teacher ratios are lowest in Europe and highest in Africa",
    y = "Median student-to-teacher ratio",
    x = ""
  ) +
  theme_minimal()

If we want to add a colour per region, we can do this and add a custom colour palette. To get rid of the legend, use guides(fill = "none") – this works for any aesthetic, so adapt e.g. for colour as needed.

st_by_region |> 
  ggplot(aes(x = fct_reorder(region,-median_st_ratio), 
             y = median_st_ratio, fill = region)) +
  geom_col(alpha = 0.8) +
  labs(
    title = "Student-to-teacher ratios are lowest in Europe and highest in Africa",
    x = "",
    y = "Median student-to-teacher ratio"
  ) +
  coord_flip() +
  theme_minimal() +
  scale_fill_brewer(palette = "Dark2") +
  guides(fill = "none")

For more informatio on colour palettes in R, see this website. Note that, for the use of some palettes, you will need to install additional packages.

Notes on colour palettes

This section draws on Wilke (2019).

There are three fundamental uses for colour in visualisations:

  • to distinguish groups of data
  • to represent data values
  • to highlight

Colour to distinguish groups

  • Colour is useful to distinguish discrete (unordered) items or groups, e.g. factor levels.
  • Use a qualitative colour scale for this:
    • finite set of specific colours that are chosen to be distinct but also equivalent
    • no one colour should stand out relative to the others
    • the scale should not create the impression of an order

Qualitative colour scales from the RColorBrewer package

Replot our graph from above using the RColorBrewer Set2 palette:

library(colorblindr)

st_by_region |> 
  ggplot(aes(x = fct_reorder(region,-median_st_ratio), 
             y = median_st_ratio, fill = region)) +
  geom_col(alpha = 0.8) +
  labs(
    title = "Student-to-teacher ratios are lowest in Europe and highest in Africa",
    x = "",
    y = "Median student-to-teacher ratio"
  ) +
  coord_flip() +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2") +
  guides(fill = "none")

The Okabe Ito colour scale, which is a colourblind-friendly qualitative scale, can be installed using the colorblindr package – website.

To install this package, we need to take a slightly different approach, namely installing from a github repository. This is accomplished using install_github() from the devtools package. We also need to install the packages cowplot and colorspace. As in previous sessions, make sure to install packages via the console (and you only need to do this once) – don’t include the code below in your .qmd document or you will likely have trouble knitting it.

# install.packages("devtools")
# devtools::install_github("wilkelab/cowplot")
# install.packages("colorspace")

# devtools::install_github("clauswilke/colorblindr")

Replot our graph from above using the Okabe Ito scale:

library(colorblindr)

st_by_region |> 
  ggplot(aes(x = fct_reorder(region,-median_st_ratio), 
             y = median_st_ratio, fill = region)) +
  geom_col(alpha = 0.8) +
  labs(
    title = "Student-to-teacher ratios are lowest in Europe and highest in Africa",
    x = "",
    y = "Median student-to-teacher ratio"
  ) +
  coord_flip() +
  theme_minimal() +
  scale_fill_OkabeIto() +
  guides(fill = "none")

Example for use of colour to distinguish groups

Note how we can also use fct_reorder() with mutate(), which means that we don’t need to include it in our ggplot() code. The final line of the chunk below shows how to increase text size in a plot.

gapminder |> 
  filter(year == 2007) |> 
  mutate(country = fct_reorder(country, pop)) |> 
  filter(pop > quantile(pop, probs = c(0.75))) |> 
  ggplot(aes(x = pop/1000000, y = country, fill = continent)) +
  geom_col() +
  scale_fill_brewer(palette = "Dark2") +
  labs(
    title = "Population in 2007",
    subtitle = "Countries in top quartile for population",
    x = "Population (millions)",
    y = "Country",
    fill = "Continent"
  ) +
  theme_bw() +
  theme(text = element_text(size = 12))

Using colour scales in ggplot

The following figure is from the ggplot2 cheatsheet

knitr::include_graphics("images/ggplot_scales.png")

Colour to represent values

We can use a sequential colour scale to represent quantitative values, i.e. a sequence of colours that:

  • specifies which values are larger or smaller
  • indicates distance between values
  • is perceived to vary uniformly across entire range of values
  • can be based on single hue or multiple hues

Sequential colour scales from the RColorBrewer package

Example for use of colour to represent values

  • colours to represent values can be useful to show how values vary across geographic regions (a choropleth map)
  • this code, which is adapted from [this tutorial on drawing maps in R] (https://www.r-spatial.org/r/2018/10/25/ggplot2-sf.html), uses scale_fill_viridis_c()
  • note that the ggplot syntax for maps is a little different to what we have seen before: it uses geom_sf(); you don’t need to worry about the details of this unless you would like to use maps for your work
# note: to install multiple packages at once, use:
# install.packages(c("sf","rgeos","rnaturalearth","rnaturalearthdata"))

library(sf)
library(rgeos)
library(rnaturalearth)
library(rnaturalearthdata)

world <- ne_countries(scale = "medium", returnclass = "sf")

world |> 
ggplot(aes(fill = pop_est)) +
  geom_sf() +
  scale_fill_viridis_c(option = "plasma", trans = "sqrt") +
  labs(
    title = "World map with population estimate",
    x = "Longitude",
    y = "Latitude",
    fill = "Population\nestimate"
  )

Diverging scales

Use a diverging scale to visualise values diverging from a midpoint (e.g. dataset containing positive and negative numbers):

  • ~ two sequential scales combined at a common midpoint (usually light colour)
  • scale needs to be balanced so that progression from light to dark is perceived similarly in both directions

Diverging colour scales from the RColorBrewer package

Example for use of a diverging scale

  • this plot uses scale_fill_brewer with palette PiYG
world |> 
  ggplot(aes(fill = income_grp)) +
  geom_sf() +
  scale_fill_brewer(palette = "PiYG") +
  labs(
    title = "Income group",
    x = "Longitude",
    y = "Latitude",
    fill = "Income group"
  )

Colour to highlight

Colour can be used to highlight specific elements, e.g. particular categories or values to emphasise.

  • Use an accent colour scale
    • subdued colours
      • matching set of stronger, darker or more saturated colours

Accent colour scale from the RColorBrewer package

Example for use of colour to highlight

  • this plot uses scale_colour_brewer with palette Dark2
  • and the gghighlight package
library(gghighlight)

gapminder |> 
  group_by(continent,year) |> 
  summarise(lifeExp = mean(lifeExp)) |> 
  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
  geom_line(size = 1.5) +
  gghighlight(continent %in% c("Oceania", "Asia")) +
  theme_bw() +
  scale_colour_brewer(palette = "Dark2") +
  theme(text = element_text(size = 20))  +
  labs(
    title = "Life expectancy by year and continent",
    x = "Year",
    y = "Life expectancy",
    colour = "Country"
  )

Example: Brewer colour scales

  • from ?scale_colour_brewer

Customisation for a more effective visualisation 2: adding variability information

Back to our example … Note that much of the following is based on Cedric Scherer’s blog (see link above).

What if we wanted to add variability information?

We could use a boxplot:

st_ratios |> 
  ggplot(aes(x = region, y = student_ratio)) +
  geom_boxplot()

Let’s start by making some modifications to make the plot more readable in line with the above considerations. To be able to order the boxplot, we need to add a variable with a regional summary statistic – we will use the median here. Note the use of mutate() in conjunction with group_by() to create a new variable with group-based values.

Note also the use of a custom colour scale from the University of Chicago via the ggsci package and how we customise the limits for the x axis (flipped from y).

st_ordered <- st_ratios |> 
  group_by(region) |> 
  mutate(st_by_region = median(student_ratio, na.rm=TRUE)) |> 
  ungroup() |> 
  mutate(region = fct_reorder(region, -st_by_region))
         
         
st_ordered |> 
  ggplot(aes(x = region, y = student_ratio, fill = region)) +
  geom_boxplot() +
  theme_light() +
  scale_y_continuous(limits = c(0, 90)) +
  coord_flip() +
  scale_fill_uchicago() +
  labs(
    title = "Student-teacher ratios are highest and most variable in Africa",
    x = "",
    y = "Student-to-teacher ratio"
  ) +
  guides(fill = "none")

Now add some other adjustments … Note how we can create a new object g with the basic setup of our plot, to which we can subsequently add various geometries:

library(showtext)
font_add_google("Poppins", "Poppins")
font_add_google("Roboto Mono", "Roboto Mono")
showtext_auto()

theme_set(theme_light(base_size = 18, base_family = "Poppins"))

g <-
  st_ordered |> 
  ggplot(aes(x = region, y = student_ratio, colour = region)) +
    coord_flip() +
    scale_y_continuous(limits = c(0, 90), expand = c(0.005, 0.005)) +
    scale_color_uchicago() +
    labs(x = NULL, y = "Student to teacher ratio") +
    theme(
      legend.position = "none",
      axis.title = element_text(size = 16),
      axis.text.x = element_text(family = "Roboto Mono", size = 12),
      panel.grid = element_blank()
    )

Look at varying geometries:

g + geom_boxplot()

g + geom_violin()

g + geom_line(linewidth = 1)

g + geom_point(size = 1)

g + geom_jitter(size = 1)

We can also combine geoms for more information! Note that the alpha parameter varies transparency: it ranges from 0 (fully transparent) to 1 (fully opaque). In geom_boxplot() the outlier.alpha parameter varies the transparency of the outlier points: by setting it to 0 we effectively hide these and ensure that we don’t get an overlap with the points drawn by geom_jitter().

g + geom_boxplot(outlier.alpha = 0) +
  geom_jitter(size = 2, alpha = 0.3)

For a more intuitive visualisation, we use geom_jitter() combined with the mean per region. We can get ggplot to compute this for us using stat_summary(). With set.seed(), we ensure that the “random” dot distribution produced by geom_jitter() is reproducible – this will become important when adding labels later.

set.seed(2019)
g +
  geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
  stat_summary(fun = mean, geom = "point", size = 5)

Add a line for world average:

world_avg <-
  st_ordered |>
  summarise(avg = mean(student_ratio, na.rm = T)) |>
  # pull ensures that we have only a single value not a data frame
  pull(avg)

g +
  geom_hline(aes(yintercept = world_avg), colour="grey70", size=0.6) +
  geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
  stat_summary(fun = mean, geom = "point", size = 5)

Now add lines from the world average to the regional averages:

g +
  geom_segment(
    aes(x = region, xend = region,
        y = world_avg, yend = st_by_region),
    size = 0.8
  ) +
  geom_hline(aes(yintercept = world_avg), colour="grey70", size=0.6) +
  geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
  stat_summary(fun = mean, geom = "point", size = 5)

And finally … add some text:

g +
  geom_segment(
    aes(x = region, xend = region,
        y = world_avg, yend = st_by_region),
    size = 0.8
  ) +
  geom_hline(aes(yintercept = world_avg), colour="grey70", size=0.6) +
  geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
  stat_summary(fun = mean, geom = "point", size = 5) +
  annotate(
    "text", x = 6.3, y = 35, family = "Poppins", size = 2.8, color = "gray20", lineheight = .9,
    label = glue::glue("Worldwide average:\n{round(world_avg, 1)} students per teacher")
  ) 

See the blogpost for further details on how to add more text as well as arrows.

Customisation for a more effective visualisation 2: an alternative way to visualise distributions

We could also plot density ridges!

For an introduction to geom_density_ridges() from the ggridges package, see https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html They expand on the basic geom_density() in ggplot.

Also note the addition of a caption to acknowledge the source of the data.

library(ggridges)

st_ordered |> 
  ggplot(aes(x = student_ratio, y = region, fill = region, colour = region)) +
  geom_density_ridges(alpha = 0.5, rel_min_height = 0.001, scale = 0.9, quantile_lines = TRUE, quantiles = 2) +
    scale_color_uchicago() +
    scale_fill_uchicago() +
    labs(
      y = NULL, 
      x = "Student to teacher ratio",
      caption = "Source: UNESCO") +
    theme(
      legend.position = "none",
      axis.title = element_text(size = 16),
      axis.text.x = element_text(family = "Roboto Mono", size = 12),
      panel.grid = element_blank(),
    )